909-744-2891

2014-03-01 Fundamentals of Spam Control

Fundamentals of Spam Control

Introduction

I will discuss the SMTP protocol from the perspective of collecting information that may be used in the spam filtering decision. The aim is to reject as much spam as quickly and as easily as possible, so that the more time consuming and resource intensive filters are applied to fewer messages.

A word on some abbreviations. SMTP is the "simple mail transport protocol", generally spoken on port 25 or 587, and defined in RFCs 821, 2821, and most recently 5321. An RFC is a Request for Comment, and the Internet RFCs generally define the interoperability requirements of the modern internet. See http://www.rfc-editor.org/ for others. A DNSBL is a dns based blacklist, generally a dns based mapping of either ip addresses or domain names to a binary listed/not-listed result. There are a LOT of different DNSBLs, all with different listing standards and target audiences.

Some spam filters are cheap (in both wall clock time and cpu effort), and others are much more expensive. For example, on a fairly well connected mail server, a typical DNSBL query for 1.2.3.4.zen.spamhaus.org takes about 8ms. On that same system, SpamAssassin takes from 1 to 5 seconds (wall clock time) per message. Clearly, we want to reject as much spam as we can with cheap mechanisms. The current behavior of spammers makes this fairly easy.

Our SMTP gateway is configured to accept mail for some set of email domains, and it may also be used as an outgoing relay for internal mail clients. One issue that we need to face is that not every internal email address will want or need the same set of spam filtering parameters. For example, an address like sales@example.com might want very little spam filtering. An address like ceo@example.com might need a much heavier hand on the spam filter lever.

Different email domains may need very different spam filtering parameters. Suppose company one.example.com is a local pizza shop that does no business in the far east, either with suppliers or customers. Such a company could simply reject email originating outside the US, and that particular filter would be very unlikely to reject anything that they actually wanted. An international company like IBM cannot do that, since a significant fraction of their workforce is based outside the US.

I will refer to a "filtering context" as the collection of all the available spam filtering parameters. Each email address or email domain can have it's own independent filtering context containing its own values for these spam filtering parameters.

Any incoming email message has many attributes associated with it - IP addresses, domain names, etc. For many of these attributes we can use public (or private) databases that give us some measure of "evil" with respect to those attributes. We can then combine those measures of evil to make a spam control decision; accept, reject, quarantine, etc.

Measures of Evil

We have measures of evil with respect to IP addresses, with respect to domain names, and with respect to email body text. Primary measures are functions of the original input (ip address, domain name, etc). Secondary measures are functions of functions of the original input.

Primary measures

Given an IP address, we can ask various DNSBL lists if they consider that IP address to be evil in whatever sense that the particular DNSBL uses. For example, we can look up 1.2.3.4.zen.spamhaus.org to see if ZEN considers 4.3.2.1 to be evil. There are a LOT of such DNSBLs that we could use. We could use multiple such lists, and combine the results in any linear or non-linear way to form a decision about the evil measure of a particular IP address.

Given a domain name, we can ask various lists if they consider that name to be evil. For example.com, we can lookup example.com.multi.surbl.org to see if SURBL has seen recent spam containing a reference to example.com, or if SURBL considers the name example.com to be evil in any other way.

Given an email body, we can ask systems like the DCC or SpamAssassin if they consider the email body to be evil, and we can combine those measures in arbitrary ways to form a decision about the evil measure of that email body text.

Secondary measures

Given an IP address, we can do a reverse dns lookup to convert it into a name, and then consider various primary measures of evil with respect to that name.

Given a domain name, we can do a dns A record lookup to convert it to an IP address (or a list of IP addresses), and then consider various measures of evil with respect to those IP addresses. We can do a dns NS record lookup on the original name to get more names. We can do a dns MX record lookup on the original name to get more names.

Given an email body, we can scan the body text for URLs or host names, and then use various primary or secondary messages of evil with respect to any or all of those names.

I should make a special mention here of such NS lookups. This particular method works well because

SMTP protocol commands

The SMTP protocol has a number of commands. We will describe some of these commands, and analyze the spam control possibilities given the information that is available at each command. The various commands are HELO, AUTH, MAIL FROM, RCPT TO, and DATA.

CONNECT phase

The SMTP client has made a TCP/IP connection to our mail server, and at this point we know the IP address of that client machine - actually the IP address of the far end of the tcp socket. We will probably be interested in various measures of evil for this IP address, but the exact choice of which measures we want will only be made later. So in general we will just remember this IP address and continue the SMTP conversation.

Note that if you have some intermediate proxy device between your mail server and the net, then you will only see the ip address of your proxy device here. Dont't do that - you are throwing away valuable information.

Common errors - run an intermediate proxy so that your mail server sees every connection as coming from 192.168.0.x, and then check a DNSBL on every incoming connection to see if that 192.168.0.x (or the other perennial favorite 10.0.0.x) is listed on that DNSBL. Don't do that. For extra negative points, make these DNSBL queries directly from some non-caching software, so that every received mail hits the DNSBL dns servers with another redundant query.

HELO/EHLO

The next command is HELO. The SMTP client gives us a string which is supposed to be a fully qualified domain name that can be resolved to an IP address. We may be interested in various measures of evil for this name, but again the exact choice of which measures we want will only be made later. So in general we will just remember this name and continue the SMTP conversation.

We can also compare the HELO name given by this client with previous HELO names that we have recently seen from that same IP address. Any IP address that gives us a wide variety of different HELO names might be considered suspicious enough to be evil, even if none of the individual names are considered to be evil.

AUTH

The client can optionally authenticate with a username/password. We will remember this, since it is useful for whitelisting our own direct clients outgoing mail, and also for rate limiting our own clients outgoing mail. In a sendmail milter, this is available in the {auth_authen} symbol in the MAIL FROM function.

MAIL FROM

The next command is MAIL FROM. The SMTP client gives us an email address that is supposedly the sender of this message. It is either the empty path <> in which case this is a bounce message, or it is a fully qualified email address of the form user@domain. We may be interested in various measures of evil with respect to this name, but again the exact choice of which measures we want will only be made later. So in general we will just remember this name and continue the SMTP conversation.

RCPT TO

The next command is RCPT TO. The SMTP client gives us an email address for one of the recipients of this message. Here is where things get interesting.

We use the recipient email address to select the filtering context for this recipient. That selection can use any information that we have up to this point in the SMTP conversation.

context = f(client ip, helo name, authentication, mail from, rcpt to)

One of the main components of a filtering context is the selection of the set of DNSBLs to be used to reject mail to this recipient. Now that we have selected the filtering context, we can lookup the SMTP client ip address on those DNSBLs, and we might be able to reject this recipient with a message such as:

550 5.7.1 mail from 4.3.2.1 rejected - http://example.com?4.3.2.1

Other components of the filtering context might include requirements such as:

At this point, we can process all of the filtering context components that don't need to see the mail body, and see if any of them will cause this recipient to be rejected.

Messages may have multiple recipients. In this case, we have multiple filtering contexts that should be applied independently to the same message. The components of a filtering context can be split into those components that are functions of information that is known at RCPT time, and those components that are functions of the message body (and possibly other data). It is easy to apply the first set of components independently to each recipient. But what do we do when the second or subsequent recipients uses a filtering context whose body content filtering conflicts with the body content filtering of the first recipient?

452 4.2.1 incompatible filtering contexts

If the second recipient requires their own body content filtering context, we can temp-fail that second recipient address, and the sender is responsible for resending it. However, many times, our recipients are willing to accept the body content filtering parameters of the first recipient.

Rate limiting - this is where we can apply rate limiting. If the client authenticated, then we have a username, and we can keep a history of the number of recipients that have been sent messages by that user over the past few minutes, hours, days, and reject the mail if some threshold has been exceeded. This at least mitigates the damage caused by an internal client getting infected, and sending bulk mail thru your corporate outbound mail server. And of course you can then use this detection to trigger other actions to firewall or contain that infected client machine.

If the client did not use SMTP AUTH authentication, you can still impose rate limits by client ip address, or by MAIL FROM address.

DATA

The next command is DATA, and it is here that we receive the headers and body text of the message. Modern message bodies may contain MIME encoded parts, base64 coding, utf-8 or other character encodings, etc. During receipt of the message body, we can decode these various schemes to reconstruct a basic ascii text stream. We can then look for URLs or host names in that text stream, and look for various measures of evil with respect to those names.

We can also feed this message into systems like the DCC or SpamAssassin and ask those systems for their measure of evil with respect to this message.

Silent or noisy rejections

At the end of the DATA command, we process those components of the filtering contexts that need to see the mail body. Any recipients whose filtering contexts reject the message will be removed from the recipient list. If the resulting recipient list is empty, we can permanently reject the message here at the DATA stage.

Note that we *can* tell the SMTP client that we have accepted the message, and yet fail to deliver it to some of the recipients. This is generally not a good idea. Our spam filtering may have made a mistake, and this is actually a message that was sent by a real person, and the recipient actually wanted this message. If we simply fail to deliver the message after telling the SMTP client that we have accepted it for delivery, there will be no bounce or reject messsage. This is a silent delivery failure, and these are generally difficult for our users to diagnose.

However, suppose the message is addressed to two recipients. The first recipient wants body content filtering via SpamAssassin, rejecting if the SA score is larger than 10. The second recipient wants to whitelist this message. We run the message thru SA and get a score of 20. Since we have a whitelisted recipient, we accept the message, silently drop delivery for the first recipient, and deliver the message to the second recipient's mailbox.

Alternatively, if our filtering contexts are flexible enough, we could have detected that possibility earlier, and temp-failed (4xy) the second recipient to force the sender split the envelope. We would then not have to deal with such a silent failure here at the end of the DATA phase.

Spam filtering on internal machines

Suppose you have an external mail relay machine, that accepts all the mail and forwards it to an internal machine. Spam filtering on that internal machine has many problems, and not all of them have solutions.

The end result is that your spam filtering needs to be done on the external mail relay machine. And in this case, you need to arrange things so that that external machine has access to the list of valid users on your internal machine(s). There are many ways to do that, including LDAP, probing ahead with an SMTP transaction, periodic offline copying of the list, etc.

Other considerations

There are systems that don't deal well with 5xy rejections at the DATA stage. In particular, a number of commercial bulk mailers don't see that as a rejection, and they will continue to send messages (sometimes even that same message) to the same recipient.

We can essentially split the universe of SMTP clients into two groups. One group includes all the machines running real mail servers such as Sendmail, Qmail, IMail, Exchange, Notes and others of that sort. The other group includes all the rest, which are mostly virus infected machines running some spamware bot code. Now what happens when we issue a permanent failure 5xy SMTP reply to a real mail server? That machine (an SMTP client to us), should generate a bounce message, which is probably deliverable by that machine, since it probably accepted the original message from one of it's own clients. What happens when we issue such a permanent failure 5xy SMTP reply to a virus infected bot machine? Nothing - in general such bots might notify their controller that they could not deliver the message, but that notification is not generally done over SMTP.